Policy Iteration for Continuous-Time Average Reward Markov Decision Processes in Polish Spaces

نویسندگان

  • Quanxin Zhu
  • Xinsong Yang
  • Chuangxia Huang
  • Nikolaos Papageorgiou
چکیده

and Applied Analysis 3 ii A is an action space, which is also supposed to be a Polish space, andA x is a Borel set which denotes the set of available actions at state x ∈ S. The set K : { x, a : x ∈ S, a ∈ A x } is assumed to be a Borel subset of S ×A. iii q · | x, a denotes the transition rates, and they are supposed to satisfy the following properties: for each x, a ∈ K and D ∈ B S , Q1 D → q D | x, a is a signed measure on B S , and x, a → q D | x, a is Borel measurable on K; Q2 0 ≤ q D | x, a < ∞, for all x /∈D ∈ B S ; Q3 q S | x, a 0, 0 ≤ −q x | x, a < ∞; Q4 q x : supa∈A x −q x | x, a < ∞, for all x ∈ S. It should be noted that the property Q3 shows that the model is conservative, and the property Q4 implies that the model is stable. iv r x, a denotes the reward rate and it is assumed to be measurable on K. As r x, a is allowed to take positive and negative values; it can also be interpreted as a cost rate. To introduce the optimal control problem that we are interested in, we need to introduce the classes of admissible control policies. Let Πm be the family of function πt B | x such that i for each x ∈ S and t ≥ 0, B → πt B | x is a probability measure on B A x , ii for each x ∈ S and B ∈ B A x , t → πt B | x is a Borel measurable function on 0,∞ . Definition 2.1. A family π πt, t ≥ 0 ∈ Πm is said to be a randomized Markov policy. In particular, if there exists a measurable function f on S with f x ∈ A x for all x ∈ S, such that πt {f x } | x ≡ 1 for all t ≥ 0 and x ∈ S, then π is called a deterministic stationary policy and it is identified with f . The set of all stationary policies is denoted by F. For each π πt, t ≥ 0 ∈ Πm, we define the associated transition rates q D | x, πt and the reward rates r x, πt , respectively, as follows. For each x ∈ S, D ∈ B S and t ≥ 0, q D | x, πt : ∫ A x q D | x, a πt da | x , r x, πt : ∫ A x r x, a πt da | x . 2.2 In particular, we will write q D | x, πt and r x, πt as q D | x, f and r x, f , respectively, when π : f ∈ F. Definition 2.2. A randomized Markov policy is said to be admissible if q D | x, πt is continuous in t ≥ 0, for all D ∈ B S and x ∈ S. The family of all such policies is denoted by Π. Obviously, Π ⊇ F and so that Π is nonempty. Moreover, for each π ∈ Π, Lemma 2.1 in 16 ensures that there exists 4 Abstract and Applied Analysis a Q-process—that is, a possibly substochastic and nonhomogeneous transition function P s, x, t,D with transition rates q D | x, πt . As is well known, such a Q-process is not necessarily regular; that is, we might have P s, x, t, S < 1 for some state x ∈ S and t ≥ s ≥ 0. To ensure the regularity of aQ-process, we shall use the following so-called “drift” condition, which is taken from 14, 16–18 . Assumption A. There exist a measurable function w1 ≥ 1 on S and constants b1 ≥ 0, c1 > 0, M1 > 0 andM > 0 such that 1 ∫ Sw1 y q dy | x, a ≤ −c1w1 x b1 for all x, a ∈ K; 2 q x ≤ M1w1 x for all x ∈ S, with q x as in Q4 ; 3 |r x, a | ≤ Mw1 x for all x, a ∈ K. Remark 2.1 in 16 gives a discussion of Assumption A. In fact, Assumption A 1 is similar to conditions in the previous literature see 19, equation 2.4 e.g., , and it is together with Assumption A 3 used to ensure the finiteness of the average expected reward criterion 2.5 below. In particular, Assumption A 2 is not required when the transition rate is uniformly bounded, that is, supx∈Sq x < ∞. For each initial state x ∈ S at time s ≥ 0 and π ∈ Π, we denote by P s,x and E s,x the probability measure determined by P s, x, t,D and the corresponding expectation operator, respectively. Thus, for each π ∈ Π by 20, pages 107–109 there exists a Borel measureMarkov process {xπ t } we shall denote {xπ t } by {xt} for simplicity when there is no risk of confusion with value in S and the transition function P s, x, t,D , which is completely determined by the transition rates q D | x, πt . In particular, if s 0, we write E 0,x and P 0,x as E x and P x , respectively. If Assumption A holds, then from 17, Lemma 3.1 we have the following facts. Lemma 2.3. Suppose that Assumption A holds. Then the following statements hold. a For each x ∈ S, π ∈ Π and t ≥ 0, E x w1 xt ≤ e−c1tw1 x b1 c1 , 2.3 where the function w1 and constants b1 and c1 are as in Assumption A. b For each u ∈ Bw1 S , x ∈ S and π ∈ Π, lim t→∞ E x u xt t 0. 2.4 For each x ∈ S and π ∈ Π, the expected average reward V x, π as well as the corresponding optimal reward value functions V ∗ x are defined as V x, π : lim inf T →∞ ∫T 0 E π x r xt, πt dt T , V ∗ x : sup π∈Π V x, π . 2.5 As a consequence of Assumption A 3 and Lemma 2.3 a , the expected average reward V x, π is well defined. Abstract and Applied Analysis 5 Definition 2.4. A policy π∗ ∈ Π is said to be average optimal if V x, π∗ V ∗ x for all x ∈ S. The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges.and Applied Analysis 5 Definition 2.4. A policy π∗ ∈ Π is said to be average optimal if V x, π∗ V ∗ x for all x ∈ S. The main goal of this paper is to give conditions for ensuring that the policy iteration algorithm converges. 3. Optimality Conditions and Preliminaries In this section we state conditions for ensuring that the policy iteration algorithm PIA converges and give some preliminary lemmas that are needed to prove our main results. To guarantee that the PIA converges, we need to establish the average reward optimality equation. To do this, in addition to Assumption A, we also need two more assumptions. The first one is the following so-called standard continuity-compactness hypotheses, which is taken from 14, 16–18 . Moreover, it is similar to the version for discretetimeMDPs; see, for instance, 3, 8, 21–23 and their references. In particular, Assumption B 3 is not required when the transition rate is uniformly bounded, since it is only used to ensure the applying of the Dynkin formula. Assumption B. For each x ∈ S, 1 A x is compact; 2 r x, a is continuous in a ∈ A x , and the function ∫Su y q dy | x, a is continuous in a ∈ A x for each bounded measurable function u on S, and also for u : w1 as in Assumption A; 3 there exist a nonnegative measurable functionw2 on S, and constants b2 ≥ 0, c2 > 0 andM2 > 0 such that q x w1 x ≤ M2w2 x , ∫ S w2 ( y ) q ( dy | x, a ≤ c2w2 x b2 3.1 for all x, a ∈ K. The second one is the irreducible and uniform exponential ergodicity condition. To state this condition, we need to introduce the concept of the weighted norm used in 8, 14, 22 . For the function w1 ≥ 1 in Assumption A, we define the weighted supremum norm ‖ · ‖w1 for real-valued functions u on S by ‖u‖w1 : sup x∈S [ w1 x −1|u x | ] 3.2 and the Banach space Bw1 S : { u : ‖u‖w1 < ∞ } . 3.3 Definition 3.1. For each f ∈ F, the Markov process {xt}, with transition rates q · | x, f , is said to be uniform w1-exponentially ergodic if there exists an invariant probability measure μf on S 6 Abstract and Applied Analysis such that sup f∈F ∣ ∣ ∣E f x u xt − μf u ∣ ∣ ∣ ≤ Re‖u‖w1w1 x 3.4 for all x ∈ S, u ∈ Bw1 S and t ≥ 0, where the positive constants R and ρ do not depend on f , and where μf u : ∫ Su y μf dy . Assumption C. For each f ∈ F, the Markov process {xt}, with transition rates q · | x, f , is uniform w1-exponentially ergodic and λ-irreducible, where λ is a nontrivial σ-finite measure on B S independent of f . Remark 3.2. a Assumption C is taken from 14 and it is used to establish the average reward optimality equation. b Assumption C is similar to the uniform w1-exponentially ergodic hypothesis for discrete-time MDPs; see 8, 22 , for instance. c Some sufficient conditions as well as examples in 6, 16, 19 are given to verify Assumption C. d Under Assumptions A, B, and C, for each f ∈ F, the Markov process {xt}, with the transition rate q · | x, f , has a unique invariant probability measure μf such that ∫ S μf dx q ( D | x, f 0 for each D ∈ B S . 3.5 e As in 9 , for any given stationary policy f ∈ F, we shall also consider two functions in Bw1 S to be equivalent and do not distinguish between equivalent functions, if they are equal μf -almost everywhere a.e. . In particular, if u x 0 μf -a.e. holds for all x ∈ S, then the function u will be taken to be identically zero. Under Assumptions A, B, and C, we can obtain several lemmas, which are needed to prove our main results. Lemma 3.3. Suppose that Assumptions A, B, and C hold, and let f ∈ F be any stationary policy. Then one has the following facts. a For each x ∈ S, the function hf x : ∫∞ 0 [ E f x ( r ( xt, f )) − gf ] dt 3.6 belongs to Bw1 S , where g f : ∫ Sr y, f μf dy and w1 is as in Assumption A. b g f , hf satisfies the Poisson equation g ( f ) r ( x, f ) ∫ S hf ( y ) q ( dy | x, f ∀x ∈ S, 3.7 for which the μf -expectation of hf is zero, that is,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exponential Lower Bounds for Policy Iteration

We study policy iteration for infinite-horizon Markov decision processes. It has recently been shown policy iteration style algorithms have exponential lower bounds in a two player game setting. We extend these lower bounds to Markov decision processes with the total reward and average-reward optimality criteria.

متن کامل

Online Markov decision processes with policy iteration

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...

متن کامل

Convergence of Simulation-Based Policy Iteration

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide...

متن کامل

Optimal Control of Ergodic Continuous-Time Markov Chains with Average Sample-Path Rewards

In this paper we study continuous-time Markov decision processes with the average sample-path reward (ASPR) criterion and possibly unbounded transition and reward rates. We propose conditions on the system’s primitive data for the existence of -ASPR-optimal (deterministic) stationary policies in a class of randomized Markov policies satisfying some additional continuity assumptions. The proof o...

متن کامل

Approximate Policy Iteration for Semi-Markov Control Revisited

The semi-Markov decision process can be solved via reinforcement learning without generating its transition model. We briefly review the existing algorithms based on approximate policy iteration (API) for solving this problem for discounted and average reward under the infinite horizon. API techniques have attracted significant interest in the literature recently. We first present and analyze a...

متن کامل

Batch Policy Iteration Algorithms for Continuous Domains

This paper establishes the link between an adaptation of the policy iteration method for Markov decision processes with continuous state and action spaces and the policy gradient method when the differentiation of the mean value is directly done over the policy without parameterization. This approach allows deriving sound and practical batch Reinforcement Learning algorithms for continuous stat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010